feat(eval): seed hand-graded golden cases so the benchmark can run (GEPA)#660
Open
Victor "David" Medina (Victor-David-Medina) wants to merge 1 commit into
Open
feat(eval): seed hand-graded golden cases so the benchmark can run (GEPA)#660Victor "David" Medina (Victor-David-Medina) wants to merge 1 commit into
Victor "David" Medina (Victor-David-Medina) wants to merge 1 commit into
Conversation
…n (GEPA) Check-before-build found the golden set ALREADY EXISTS (106 hand-graded cases: GOLDEN_CASES_V1 + GOLDEN_CASES_RECOVERY_V1, covering the 4 revenue workflows + morning-brief/churn/isa), so building new golden cases would be pure duplication. The REAL gap: those cases live only as static TS constants, imported by nothing but the barrel. The eval/PGR benchmark reads from the evaluation_golden_cases Supabase table, and run-first-eval.ts bails with 'No active golden cases found. Seed cases first.' The graded content was orphaned from the runtime path - the benchmark could not run. This bridges them: lib/eval/golden-seed.ts (pure source + DB-row mapper) + scripts/seed-golden-cases.ts (idempotent upsert-by-title, --dry-run, founder-gated on prod). Now seed once -> run-first-eval establishes PGR baselines -> the 'we grade ourselves' benchmark is live. Honest at proof=0: golden cases are representative fixtures grading draft QUALITY, not recovered dollars. Tests (pure, no DB): combined set complete + unique titles (the dedup key) + covers the graded workflows + maps cleanly to the insert row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🛡️ Cascade Quality Score: 100/100
Threshold: 85/100 | Result: PASS ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GEPA benchmark: seed the hand-graded golden cases so it can actually run
Check-before-build first. The scoped task was "build the GEPA golden-set (recovery-draft + morning-brief)." Reading the eval system first showed that would be 100% duplication - the golden set already exists:
lib/eval/golden-cases-recovery-v1.ts- 56 graded cases across all 4 revenue workflows (lapsed-winback, estimate-recovery, review-lift, slot-rescue), each with an idealreference_verdict+ graded dimensions.lib/eval/golden-cases-v1.ts- 50 graded cases including morning-brief, churn-prediction, isa-routing, content, council.The real gap (a genuine "non-working" section)
Those 106 cases are imported by nothing except the
lib/evalbarrel. The eval/PGR benchmark reads golden cases from theevaluation_golden_casesSupabase table (golden-dataset.ts), andscripts/run-first-eval.tsbails immediately:There is no seed that bridges the static constants into that table. The graded content was orphaned from the runtime path - the benchmark could not run at all.
The fix (wire it live, don't rebuild)
lib/eval/golden-seed.ts- pure source:getSeedGoldenCases()(combines both sets) +toGoldenCaseInsert()(maps a staticGoldenCaseto a DB row; drops the client stringidsince the table auto-generates a UUID, drops timestamps). No IO, unit-testable.scripts/seed-golden-cases.ts- idempotent seed: upsert-by-title (safe to re-run, never deletes),--dry-runpreview, founder-gated on prod (writes to Supabase).__tests__/golden-seed.test.ts- pure (no DB): combined set complete + unique titles (the dedup key) + covers the graded workflows incl morning-brief + maps cleanly to the insert row + every case is actually graded (not schema-only).After merge:
npx tsx scripts/seed-golden-cases.ts→npx tsx scripts/run-first-eval.ts→ the "we grade ourselves" PGR benchmark is live. Honest atproof_events=0: golden cases are representative fixtures grading draft quality, not recovered dollars.Generated with Claude Code by RelayLaunch